Proteomics Standards Initiative’s ProForma 2.0: Unifying the Encoding of Proteoforms and Peptidoforms

It is important for the proteomics community to have a standardized manner to represent all possible variations of a protein or peptide primary sequence, including natural, chemically-induced and artifactual modifications. The Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) in collaboration with several members of the Consortium for Top-Down Proteomics (CTDP) has developed a standard notation called ProForma 2.0, which is a substantial extension of the original ProForma notation developed by the CTDP. ProForma 2.0 aims to unify the representation of proteoforms and peptidoforms. ProForma 2.0 supports use cases needed for bottom-up and middle-/top-down proteomics approaches and allows the encoding of highly modified proteins and peptides using a human-and machine-readable string. ProForma 2.0 can be used to represent protein modifications in a specified or ambiguous location, designated by mass shifts, chemical formulas, or controlled vocabulary terms, including cross-links (natural and chemical), and atomic isotopes. Notational conventions are based on public controlled vocabularies and ontologies. The most up-to-date full specification document and information about software implementations are available at http://psidev.info/proforma.

Appendix I. Levels of Compliance 28 7.
Appendix II: Extensions to improve the representation of PSMs in mass spectra 30 7.1 Representation of the ion charges 30 7.2 Representation of multiple peptidoform assignments in chimeric spectra 30 8.
Appendix III. Glossary of terms used in the specification 32 9.

Description of the need
Protein and peptide sequences are usually represented using a string of amino acids using a well-known one letter code endorsed by the IUPAC (see e.g. https://wissen.science-andfun.de/chemistry/biochemistry/iupac-one-letter-codes-for-bioinformatics/). Representing all the possible variations of a protein or peptide primary structure, including both artefactual and post-translational modifications (PTMs) of peptides and proteins is less clear. For example, the Consortium for Top-Down Proteomics (CTDP) has introduced a standard proteoform notation format called ProForma [1,2] for writing the primary structures of fully characterized proteoforms [3]. Proteoforms comprise protein species that include variations arising from genetic, transcriptomic, translational, post-translational, and artefactual (e.g., during sample processing) sources. ProForma specifically focuses on representing post-translational modifications of endogenous and artefactual sources.
Briefly, ProForma describes proteoforms as the amino acid sequences (the one-letter code representation) complemented with information on any modifications (of a known identity or via unidentified mass shifts) given in brackets following certain amino acids.
Despite its suitability to support a wide range of possible use cases, the original ProForma notation had some limitations. Additionally, the Proteomics Standards Initiative (PSI) has developed a format called PEFF (PSI Extended FASTA Format, http://www.psidev.info/peff) [4]. Although PEFF's primary intended use is for representing search databases for optimising proteomics analyses, PEFF can also be used to represent proteoforms [3] (see more details in Section 4). Therefore, there are multiple ways of encoding protein modifications and extended discussion has taken place to achieve a consensus. A comprehensive standard notation for proteoforms, as well as for their peptidic counterparts -peptidoforms (term introduced in [5])-is then required for the community, so that it can enhance the current description or be newly embedded in many relevant PSI (and potentially other) file formats.
The format specification presented here, ProForma 2.0, represents the consensus between both groups, CTDP and PSI, for the enhanced standard representation of proteoforms and peptidoforms. Compared to the original ProForma notation, it aims to support a broader variety of peptidomics and proteomics approaches, including bottom-up (focused on peptides/peptidoforms) and middle/top-down (focused on proteins/proteoforms) approaches [6]. The name of the notation, ProForma 2.0, derives from the original ProForma notation introduced by CTDP. For simplicity, going forward we will refer to this extended notation as ProForma.

Requirements
The main eight requirements to be fulfilled for a proteoform and peptidoform notation are: -It MUST be a string that is human readable, so it can be generally understood by human individuals.
ProForma 2.0 (Proteoform and Peptidoform Notation) February 3, 2022 5 http://psidev.info/proforma -It MUST be machine parsable. Other variants of this notation will not be supported computationally, although they could be 'human readable.' -It MUST be able to support the encoding of amino acid sequences and protein modifications. -It MUST be able to support the main use cases needed by the proteomics community as a whole, including both bottom-up (focused on peptides/peptidoforms) and middle/top-down (focused on proteins/proteoforms) approaches.
-It MUST be flexible to accommodate different "flavours" of notations, considering common current use. -It MUST be compatible with existing PSI file formats, where it could be used.
-It MUST be able to capture ambiguity in the position of the modified sites.
-It MUST be able to evolve, so new use cases can be added iteratively in the future.
Several of these requirements, particularly the first three, coincide with those of the original ProForma notation [2]. The fourth requirement was present in the ProForma notation description, but now includes support for the bottom-up proteomics-specific entities, i.e., peptides, whereas the original ProForma notation exclusively targeted whole proteoforms. The final four requirements are new.

Issues to be addressed
The main issues to be addressed by ProForma are: -It MUST be able to represent peptidoforms and proteoforms in a consistent and reproducible way, considering the different ways of representing protein modifications. -It MUST be able to be used jointly with the Universal Spectrum Identifier (USI), to represent peptide-spectrum matches (PSMs), and to represent proteoformspectrum matches (PrSMs).

The documentation
The documentation of the ProForma Notation for proteoform and peptidoforms is divided into several components. All components in their most recent form are available at the 6 http://psidev.info/proforma HUPO-PSI website (http://psidev.info/proforma) and at the ProForma GitHub page (https://github.com/HUPO-PSI/ProForma/).
-Main specification document (this document).
-List of current implementations with examples.

Relationship to other specifications
The format specification described in this document is not being developed in isolation; indeed, it is designed to be complementary to, and thus used in conjunction with, several existing and emerging models. Related specifications include the following: 1. PSI Universal Spectrum Identifier (http://www.psidev.info/USI). The PSI Universal Spectrum Identifier is designed to provide a universal mechanism for referring to a specific spectrum in public repositories. It can optionally include an interpretation of the spectrum using the notation described in this specification. Displayers of USIs MAY use any of the supported ProForma notations. 2. mzSpecLib, the PSI spectrum library format (http://psidev.info/mzSpecLib). The PSI spectrum library format is being developed as a standard mechanism for storing spectrum libraries. Identified spectra of modified peptides, will have to include the modification information, potentially in this ProForma notation. Furthermore, many spectrum library entries are derived from multiple spectra, and this provenance will be referenced using USIs. 3. PROXI (http://www.psidev.info/proxi). The Proteomics Expression Interface being developed by the PSI is a standardized API by which mass spectrometry proteomics information can be exchanged. References to individual spectra will be made via USIs. 4. PEFF (http://www.psidev.info/peff). Although it is not its main intended use, the PSI Extended Fasta Format enables the representation of proteoforms [4]. However, PEFF was not designed for the representation of the (potentially much shorter) peptidoforms. Additionally, PEFF 1.0 supports formally only a subset of the use cases outlined in this specification. Another key difference is that each proteoform instance in PEFF requires a FASTA header, whereas this is not required in ProForma. 5. ProForma (http://psidev.info/proforma). ProForma Proteoform Notation version 1, which enables the representation of proteoforms (https://topdownproteomics.github.io/ProteoformNomenclatureStandard/), developed by the CTDP [1]. This specification is subsumed by this new version 2 ProForma specification. 7 http://psidev.info/proforma

The Basic Form of the Proteoform and Peptidoform Notation
The ProForma notation is a string of characters that represent linearly one or more peptidoform/proteoform primary structures with possibilities to link peptidic chains together. It is not meant to represent higher order structures.
ProForma is case insensitive. However, within the data that follows the different keys, capitalisation may be important. In that case, capitalisation sensitivity is the decision of the supported CVs/ontologies.
Since ProForma MAY be used to represent both peptidoforms and proteoforms, there is currently no limit in its maximum length. Line breaks MUST NOT be used. However, non-ASCII characters are also allowed since non-ASCII characters can be included in the supported ontologies and controlled vocabularies (CVs).
If implementers want to add any metadata (e.g. date of creation, software, version of ontologies, etc) to ProForma entities, the way to do it in this version would be to use the INFO tag.
Due to the multiple use cases supported in this specification, it is not expected that all implementers can provide support to all the supported features from ProForma. To facilitate adoption and separate some of the use cases, there are multiple "levels of compliance" and extensions for ProForma, which are summarised in Appendix I.

The canonical amino acid sequence
Amino acid sequences are represented by strings of amino acids represented as characters using the one letter code endorsed by the IUPAC (http://publications.iupac.org/pac/1984/pdf/5605x0595.pdf and https://wissen.science-and-fun.de/chemistry/biochemistry/iupac-one-letter-codes-forbioinformatics/). There are also letters for representing ambiguous and/or unusual amino acids (see http://www.insdc.org/documents/feature_table.html#7.5.3), which are used in some UniProt entries. Some examples are: -B: Aspartic Acid or Asparagine -Z: Glutamic Acid or Glutamine -J: Leucine or Isoleucine -U: Selenocysteine -O: Pyrrolysine -X: Any amino acid (see also Section 4.2.6 Specifying a gap of known mass, for the use of X). We note that the character X itself is assigned zero mass in this notation.
The representation of non-linear peptides is NOT formalised in this version of ProForma. See the section 5.3 ("Representation of cyclic peptides") in Section 5: Pending Issues, for possible ways to represent them. 8 http://psidev.info/proforma

Generic representation of protein modifications
It has been decided that multiple formats and reference systems must be supported, because some flexibility is required. The same approach is followed for both artefactual protein modifications and natural PTMs. Square brackets MUST be used to represent them when the position is unambiguous. They are located after the character representing the modified amino acid. If there is ambiguity in the position of the protein modification, different rules apply (see section 3.3.4).
-RESID (https://proteininformationresource.org/resid/). Although RESID is included in PSI-MOD, this reference system is still used in the top-down community.

Controlled vocabulary or ontology modification names
The names from different CV or ontology terms MAY be used to represent protein modifications. The two main reference systems used are Unimod and PSI-MOD. However, to facilitate differentiation between reference systems for readers, the names coming from other three supported CV/ontology MUST be preceded by a letter and colon, indicating the originating CV/ontology. In the case of Unimod and PSI-MOD, the use of prefixes is optional.

Definition of the Unimod modification name
The Unimod OBO file SHOULD be used: http://www.unimod.org/obo/unimod.obo. Within this file, term names are found in the "name" tag. These terms differ in the Unimod web interface (http://www.unimod.org/). There, the equivalent to the "name" field in the OBO file is the "PSI-MS Name" column, if not empty (if there is a value). If the "PSI-MS Name" field is empty, the "interim name" is used. Unimod synonyms are currently NOT supported, as they are provided inconsistently.

Controlled vocabulary or ontology protein modification accession numbers
In case accession numbers from the supported CVs/ontologies are used, to report protein modifications full accession numbers MUST be used in all cases (no abbreviations in the names of the ontologies/CVs are allowed). The supported names are:

Support for cross-linkers
Support for cross-linkers is possible by using the XL-MOD CV. It is acknowledged that the current version of ProForma does not provide support for all possible use cases involving cross-linked peptides. In the future, it is expected that a specific extension for this type of information can be developed.
Using the XL-MOD CV, crosslinked sites MUST represented immediately following the modification notation using the prefix #XL, followed by an arbitrary label consisting of alphanumeric characters ([A-Za-z0-9]+ in regular expression notation). Cross-linker modification notations MUST be mentioned once only.
Any annotation made with the symbol # represents a way of linking different locations within the amino acid sequence. In ProForma 2.0 it is used for representing cross-linkers, branched peptides and for grouping protein modifications (including glycans) to represent ambiguity.

Crosslink notation (within the same peptide)
Cross-linker modification notations MUST be mentioned once only. This example shows a DSS crosslink between two lysines: This second example shows a DSS crosslink between two lysines and an EDC cross-link between two other lysines: A "dead end" crosslink happens regularly with bifunctional crosslinkers when one side attaches and the other hydrolyses before attaching. These modifications are annotated at only one site.

Representing inter-chain crosslinks
Inter-protein or inter-chain connections are supported using // to separate the crosslinked peptides. This notation is similar to IUPAC condensed notation for inter-protein connections.
It is acknowledged by the authors that more complex scenarios are possible when representing inter-chain crosslinks, including a higher number of linked peptides, directionality, etc. It is envisioned that when these use cases become a clear requirement in the future, a dedicated working group can extend these guidelines.

Representing disulfide linkages
Disulfide bonds may be represented using four possible notations: (i) Using the PSI-MOD term for "L-cystine (cross link)" (MOD:00034) to explicitly describe the cross-link using the cross-linking notation: There are more complex examples that are possible. For instance, another example with inter-chain disulfide bonds is insulin: As mentioned above, more complex scenarios are possible which will need to be resolved in future versions.
(ii) Using the XLMOD term XLMOD:02009 similarly to case (i) above: (iii) Using the PSI-MOD term for "half cystine" (MOD:00798) if the pairing is not known. Since the term is only for half the link, it must be specified on all involved sites with no group tag: (iv) Using the Unimod term for "Dehydro" (UNIMOD:374) to explicitly describe the cross-link using the cross-linking notation.

Representation of branched peptides
Branched peptides can be expressed using the same notation used for representing two cross-linked peptides, but using the term #BRANCH (see above). Examples: Where a sidechain of a Lysine from peptide 1 is linked to the C-term of the peptide 2 via amidation (-H2O) Taken from: https://www.news-medical.net/whitepaper/20180329/Synthesizing-Unsymmetrically-Branched-Peptides.aspx

Representation of glycans using the GNO ontology as CV
Glycans that are currently included in Unimod or PSI-MOD (individual or very short chains) MAY be represented that way. If the glycans are not included in either PSI-MOD or Unimod, the GNO ontology SHOULD be used. As mentioned above, the use of accession numbers is preferred since accession numbers and names are often the same. There are more complex cases, where ambiguity can be caused by multiple combinations between labile and non-labile glycans attached to the same amino acid sequence. A possible mechanism to represent these more complex cases is available in Section 5 (Pending issues). A further limitation comes from the restricted set of glycans in GNO. We expect that these issues will be solved as the glyco(proteomics) community develops in the near future.

Delta mass notation
In addition to using CV/ontologies names and/or accession numbers, mass differences (delta masses) MAY be used to represent protein modifications.
Delta masses SHOULD only be used when the protein modification cannot be represented using a CV/ontology (e.g., if software does not use ontologies/CVs), when the modification (or combination of modifications) is ambiguous (e.g., coming from open modification searches or de-novo approaches), or when it is unknown. Otherwise, protein modifications SHOULD be represented using Unimod, PSI-MOD, RESID, XL-MOD, or GNO CV parameters.
Mass differences MUST be expressed in Daltons between the coded amino acid and the observed mass. Positive mass shifts MUST be specified with a plus sign. Negative shifts MUST be specified with a negative sign. Monoisotopic masses MUST be used. There are two ways of representing delta masses: Interpretation of the actual delta masses is then left to the reader software. B) Using prefixes for CVs/ontologies to provide more information.
If "canonical" delta masses are directly taken from a CV/ontology, the corresponding abbreviation to that CV/ontology MAY be used. The notation also supports the encoding of experimentally observed delta masses. In those cases, the prefix "Obs" MUST be used. The number of significant figures included in the delta mass depends on the accuracy of the available data and SHOULD be used as is by interpreters. Example:

Specifying a gap of known mass
This mechanism can be used to express a gap in the sequence of an unknown number of amino acids, but the corresponding mass difference is known. This is supported by the use of the character X followed by brackets indicating the total mass of the gap, meaning that the mass of X is actually zero. Example of proper gap notation:

Support for elemental formulas (e.g. for representing small molecular substructures or functional groups)
A modification representing a small molecular substructure or a functional group can be described by a chemical formula. The descriptor "Formula" MUST be used. Only elemental formulas are supported. Example of proper chemical formula usage: As no widely accepted specification exists for expressing elemental formulas, we have adapted a standard with the following rules (taken from https://github.com/rfellers/chemForma): Formula Rule 1 A formula will be composed of pairs of atoms and their corresponding cardinality (two Carbon atoms: C2). Pairs SHOULD be separated by spaces but are not required to be. Atoms and cardinality SHOULD NOT be. Also, the Hill system for ordering http://psidev.info/proforma (https://en.wikipedia.org/wiki/Chemical_formula#Hill_system) is preferred, but not required.
Example: C12H20O2 or C12 H20 O2 Formula Rule 2 Cardinalities must be positive or negative integer values. Zero is not supported. If a cardinality is not included with an atom, it is assumed to be +1.

Example: HN-1O2
Formula Rule 3 Isotopes will be handled by prefixing the atom with its isotopic number in square brackets. If no isotopes are specified, previous rules apply. If no isotope is specified, then it is assumed the natural isotopic distribution for a given element applies. See in Section 5 (Pending issues) how this mechanism could be extended in the future to support more complex molecular formulas.

Representation of glycan composition
Glycan residues (generic monosaccharides) can be represented using the descriptor "Glycan". If glycan symbols conflict with themselves or element symbols in such a way that ambiguities occur, we will consider requiring spaces between 'atoms' (see Formula Rule #1).

SEQUEN[Glycan:HexNAc1Hex2]CE
The supported list of monosaccharides in ProForma is included below. It is worth noting that the masses and elemental compositions included below for each monosaccharide are those resulting after each of them are condensed with the amino acid chain. However, we envision that more monosaccharides could be added once this specification document is formalised. An updated list of supported monosaccharides (in two different formats, obo and json) can be found at: https://github.com/HUPO-PSI/ProForma/tree/master/monosaccharides For other glycans not included there, a new CV term will need to be created, e.g. in PSI-MOD.
It is recognised that this mechanism is limited and can only support the most common glycans. It is envisioned that in the future, when this use case becomes a requirement, a dedicated working group can work in extending these specific guidelines. See Section 5 (Pending issues) for guidance on future extensions of this mechanism to support other macromolecules, e.g. lipids.

Best practices on the use of protein modifications
In the same sequence, the same reference system SHOULD be used to represent the protein modifications. However, the delta mass notation (Section 4.2.5) MAY be combined with the other cases.

N-terminal and C-terminal modifications
The square brackets containing the modification MUST be located before the first amino acid in the sequence or after the last amino acid in the peptide sequence. In both cases, they are separated by a dash (- One can also express multiple labile modifications using the following notation: {Glycan:Hex}{Glycan:NeuAc}EMEVNESPEK

Support for the representation of ambiguity in the modification position
This notation is used to represent ambiguous modified sites, associated positions and associated probabilities or scores.
This notation is not yet supported for crosslinker modifications (see Section 5.9), except for the case of disulfide cross-linkers which may be represented with ambiguous position using the PSI-MOD term for "half-cystine" (MOD:00798), as noted in Section 4.2.3.3.iii.

Unknown modification position
The positions of some modifications may be unknown. In this case, protein modifications are represented using square brackets that MUST be located on the left side of the amino acid sequence. The symbol '?' is used to indicate that the actual position of the modification is unknown. [

Indicating a possible set of modification positions
The position of a modification may be unknown but belong to a known set of possible sites. In this case, the possible positions for the modifications may be indicated. The rules that MUST be followed are: (i) Groups of possible sites for a modification are represented immediately following the modification notation using the symbol #, followed by an arbitrary label consisting of alphanumeric characters ([A-Za-z0-9]+ in regular expression notation). Note that the label prefix #XL is a special case that MUST be reserved for crosslinkers only.
(ii) A single preferred location for the modification MUST be specified, so that the sequence can be easily rendered in visualization tools. The preferred location for the modification is indicated by the position of the modification notation in the amino acid sequence.
In this example, '#g1' is used as the arbitrary label:

EM[Oxidation]EVT[#g1]S[#g1]ES[Phospho#g1]PEK
This is read as a named group 'g1' indicates that a phosphorylation exists on either T5, S6 or S8, and S8 is the preferred location because the notation 'Phospho' is placed at this position.
The following example is not valid because a single preferred location must be chosen for a modification: The caret symbol (^), which can be used to represent multiple instances of the same unlocalised modification before the N-terminal end of the amino acid sequence (Section 4.4.1), is not allowed within the amino acid sequence. http://psidev.info/proforma

Representing ranges of positions for the modifications
Overlapping ranges represent a more complex case and are not yet supported, and so, the following example would NOT be valid:

Indicating modification position preference and localisation scores
There are two options to represent this type of information. The values of the modification localisation scores can be indicated in parentheses within the same group and brackets. Scores for the modification position can be expressed as probabilities and/or FLR (False Localisation Rate), but the actual meaning of the scores is not reported. The preferred location of the modification notation reflects the value of the scores. If there is a tie in the value of the localisation scores, one preferred position needs to be chosen by the writer.
An additional option to represent localisation scores is to leave the position of the modification as unknown using the '?' notation but report the localisation modification scores at specific sites.
Example of proper usage of localisation scores with unknown modification site notation:

Representation of multiple modifications in the same amino acid residue
It is possible to represent two or more modifications on the same amino acid or group of amino acids. The caret symbol (^), which can be used to represent multiple instances of the same unlocalised modification before the N-terminal end of the amino acid sequence ( Currently, complex glycans are not explicitly supported (see Section 3.4). An alternative solution in those rare cases not involving glycans is to have a single PSI-MOD/Unimod entry for the combination of mods, which would need to be created in advance, if not yet available.

Representation of global modifications
This mechanism MAY be used for modifications that apply to all relevant residues in the peptide/protein amino acid sequence. These modifications MAY be represented by the use of the characters "<" and ">" on the left side of the sequences. A couple of use cases are envisioned:

Use Case 1: Representation of isotopes
This might be used in the case of synthetic peptides with 100% incorporation.
Example: Consider extension for 13C on all residues: Carbon 13: <13C>ATPEILTVNSIGQLK Nitrogen 15: <15N>ATPEILTVNSIGQLK Deuterium: <D>ATPEILTVNSIGQLK The representation of multiple isotopes is also possible. They can be located in any order.

Both Carbon 13 and Nitrogen 15: <13C><15N>ATPEILTVNSIGQLK
Distributions of isotope masses could be supported in future work.

Use Case 2: Fixed protein modifications
This mechanism can be useful especially in the case of full proteoforms. The affected amino acid MUST be indicated using @. If more than one residue were affected, they MUST be comma separated. Examples: <[S-carboxamidomethyl-L-cysteine]@C>ATPEILTCNSIGCLK <[MOD:01090]@C>ATPEILTCNSIGCLK <[Oxidation]@C,M>MTPEILTCNSIGCLK Fixed modifications MUST be written prior to ambiguous and labile modifications, and similar to ambiguity notation, N-terminal modifications MUST be the last ones written, just next to the sequence. 21 http://psidev.info/proforma The following examples would be valid:

Representation of amino acid sequence ambiguity
Ambiguity in the amino acid sequence needs to be represented in some cases, e.g. to represent sequence changes that do not change the mass of the peptidoform/proteoform, but are not known. One concrete example is the need to encode the results of de novo sequencing tools. The way to encode this information is to use a parenthesis and a quotation mark including the ambiguous sequence represented in a preferred way. Examples:

General information or comments can be encoded using the 'info' tag like: ELV[INFO:AnyString]IS ELV[info:AnyString]IS
The information represented in an 'info' tag is considered non-standard (e.g. any text besides unpaired brackets) and does not need to be parsed.
"Info" tags can be split using the pipe character. Example of proper 'info' tag usage:

Support for the joint representation of experimental data and its interpretation
The pipe character "|" is used to represent protein modifications simultaneously with CV/ontology names and/or accession numbers, and delta masses. As explained in Section 4.2.6, Delta mass notation, it is possible to represent both canonical delta masses and experimental observations, allowing the representation of both interpretation (using CV/ontology names/accession numbers) and experimental observations (delta masses). Ambiguous cases are also allowed because they can be used to represent "comparable" information.

ELVIS[Obs:+79.966|Phospho|Sulfo]K
Highly different modifications SHOULD NOT be joined as it would be difficult for readers to correctly interpret. It is however acknowledged that readers can choose to implement the parsing in different ways. Some tools may always take CV terms, others could take delta masses, and so on. 23 http://psidev.info/proforma

Pending Issues -Future developments
Additionally, there are several use cases that are NOT currently supported in the current version of the specification. These complications are left open in version 2.0 of the specification and will ideally be addressed in future versions, after the community has gained more experience with the common cases. The objective here is to document those cases appropriately and propose some possible solutions for representing the information in future versions of ProForma.

Representation of cyclic peptides
Cyclic peptides are only currently supported if they can be represented using the supported CVs/ontologies for protein modifications. The following examples represent possible ways to represent cyclic peptides, but these solutions need to be formalised and PSI-MOD modifications created. where MOD:nnnnnn would be a new PSI-MOD term to represent backbone cyclisation involving the amidation between a C-terminal carboxylate and a N-terminal amine, with mass difference of O-1H-2 (-18 Da).
2) Cyclic peptide with C-and N-termini bound together at the peptide backbone level with 3 disulfide bonds Retrocyclin 1 (PubChem ID 16130540). The exact structure is the following: https://pubchem.ncbi.nlm.nih.gov/compound/Retrocyclin-1#section=Biologic- where MOD:nnnnnn would be a new PSI-MOD term to represent backbone cyclisation involving the amidation between a C-terminal carboxylate and a N-terminal amine, with a mass difference of O-1H-2 (-18 Da).

Representation of ambiguity when different glycans are attached to the same amino acid sequence
Multiply glycosylated peptides, especially under vibrational/collisional dissociation, may fragment in ways that allow sequencing the peptide backbone without completely characterizing the glycan sites. Instead, only the aggregate composition can be determined based on the precursor peptide mass. In such cases, only the glycosylation may be known, by motif for N-glycan or there may be several possible sites. Alternatively, the total number of glycosylation sites may be unknown (O-glycans), with the aggregate glycan composition may be spread across positions in unknown proportions.
There is a need to express that a site is a possible glycosylation site as well as a mechanism to express the total amount of glycan composition shared across these sites. The latter is achieved by using a labile modification to prefix the total composition. There are multiple proposals for expressing putative site assignment: This peptide hosts two N-glycans, where the glycan class is known from the required motifs on the sequence, and that it is multiply glycosylated because no single N-glycan with the aggregate composition is biosynthetically feasible. This proposal denotes the inferred glycosylation sites using the PSI-MOD "N-glycosylated residue" term. This forces the reader to treat this group differently, where the modification is inferred to be the labile glycan modification and that the modification may be split amongst each site, assigning zero or more monosaccharides to each group position. Pros: • Conveys extra metadata about the glycan type • Uses an existing term Cons: • Introduces new semantics for a modification that is not explicitly conveyed notationally, namely that this modification is not observable, but just encodes positional information. • For complex and ambiguous O-glycopeptides, this method would pull double-duty with ambiguity notation. The same case with Proposal 1, but instead of adding extra baggage to an existing term, this proposal uses ambiguity groups to denote possible positions, and mark one group with a new "Glycan" key, which adds the same labile modification inference step. Pros: ProForma 2.0 (Proteoform and Peptidoform Notation) The only differentiating feature of this proposal from Proposal 2 is that it isolates the notational change solely within the Glycan tag handling, which reduces the burden on implementers who do not want to support glycosylation.

Representation of rare amino acids not supported by the one letter code
This use case is currently not supported. These SHOULD be handled through their representations in one of the supported ontologies/CVs.

Representation of average masses
During the development of the format, it was acknowledged that, in the case of top-down proteomics approaches, there could be cases where monoisotopic masses are unknown, and then average masses need to be used. At the moment, monoisotopic masses are the only ones formally allowed, but this MAY have to change in future changes.

Representation of lipids
These SHOULD be handled through their representations in one of the supported ontologies/CVs. However, a similar mechanism to the one described in Section 4.2.8, Representation of glycan composition, could be implemented for lipid molecules. It is envisioned that when this use case becomes a clear requirement in the future, a dedicated working group can extend these specific guidelines.

Distribution of isotopes in the sequence
The representation of the distributions of isotopes for global modifications (Section 4.6) is not supported in the current version of the specification. A mechanism will need to be envisioned to support this use case in future versions.

Representation of molecular formula
Elemental formulas are supported by the current version of ProForma (Section 4.2.7), and molecular formulas may be supported in the future if it would prove helpful. For example, specifying branching in a PTM structure. A molecular formula may include repeated (condensed) sections using parentheses and an extra cardinality.

Representation of an overlapping range of possible modification positions
Notation of ambiguous localization currently supports non-overlapping ranges. A possible representation of overlapping ranges, that may be considered in the future, uses a grouping tag for both parentheses.

Representation of ambiguous crosslinker modification positions
Notation for ambiguous crosslinker modification positions is not supported in this version of ProForma but may be supported in the future.

Metadata related to ProForma entries
ProForma 2.0 (Proteoform and Peptidoform Notation) February 3, 2022 27 http://psidev.info/proforma At present, metadata related to ProForma entries (e.g. date of creation, software, version of ontology used, etc) cannot be provided in a standardised manner. The creation of an additional metadata file could be considered in future versions.

Representation of sequences coming from non-mass spectrometry-based proteomics approaches
The Proforma notation could be made compatible with non-mass spectrometry proteomics approaches, such as nanopore and Edman-based sequencing, and other that will face the same notation challenges. A mechanism will need to be envisioned to support these use cases in future versions. http://psidev.info/proforma

Appendix I. Levels of Compliance
Due to the multiple use cases supported in this specification, it is not expected that all implementers can provide support to all the supported features from ProForma version 2.
To facilitate adoption and separate some of the use cases, there are multiple "levels of compliance" and extensions for ProForma. The technical name of the level of compliance is indicated between parenthesis to enable labelling future software.
1) Base Level Support (Technical name: Base-ProForma Compliant) Represents the lowest level of compliance, this level involves providing support for: -Amino acid sequences -Protein modifications using two of the supported CVs/ontologies: Unimod and PSI-MOD.
-Protein modifications using delta masses (without prefixes) -N-terminal, C-terminal and labile modifications.
-Ambiguity in the modification position, including support for localisation scores.
-Ambiguity in the amino acid sequence.
2) Additional Separate Support (Technical name: level 2-ProForma compliant) These features are independent from each other: -Unusual amino acids (O and U).
-Ambiguous amino acids (e.g. X, B, Z). This would include support for sequence tags of known mass (using the character X).
-Protein modifications using delta masses (using prefixes for the different CVs/ontologies).
-Support for the joint representation of experimental data and its interpretation.
3) Top-Down Extensions (Technical name: level 2-ProForma + top-down compliant) -Additional CV/ontologies for protein modifications: RESID (the prefix R MUST be used for RESID CV/ontology term names) -Chemical formulas (this feature occurs in two places in this list).
-Chemical formulas (this feature occurs in two places in this list). 6) Spectral Support (Technical name: level 2-ProForma + mass spectrum compliant) -Charge and chimeric spectra are special cases (see Appendix II). -Global modifications (e.g., every C is C13).
If one implementation supports more than one extension, multiple supported extensions can be indicated separated by '+'). Example: (level 2-ProForma + cross-linking + mass spectrum compliant).
Additionally, see Section 5 "Pending Issues -Future developments" for features not yet formally supported in this version of the specification. In the future, there could be additional extensions, e.g., for lipid molecules. http://psidev.info/proforma

Appendix II: Extensions to improve the representation of PSMs in mass spectra
This appendix is not relevant for the representation of peptidoforms and proteoforms, but rather presents techniques for representing PSMs (that is, peptidoforms and proteoforms together with mass spectra).

Representation of the ion charges
The charge value MAY be optionally indicated in the C-terminal end of the amino acid sequence, by using the forward slash ( By default, a positive number n will imply a molecular ion that is n-times protonated SEQUENCE/2 Means [SEQUENCE(neutral) + 2 protons ] and is doubly charged: By default, a negative number n will imply a molecular ion that is n-times deprotonated SEQUENCE/-2 Means [SEQUENCE(neutral) -2 protons ] and is doubly charged: [M-2H + ] 2-When the charge derives from the addition or the removal of another ion, this ionic species SHOULD be provided after the charge state number. Examples include a Na+ adduct, the addition of one electron, the removal of a OH-, the addition of an iodine ion, and a radicalisation. http://psidev.info/proforma

Appendix III. Glossary of terms used in the specification
The objective here is to provide a list of the keys used in the document, so that a summary view is available for implementers.

5-Cross-Linked Peptides and branched peptides
Using the XL-MOD CV, crosslinked sites MUST be represented immediately after the modification notation using the prefix #xl, followed by an arbitrary label consisting of alphanumeric characters ([A-Za-z0-9]+ in regular expression notation). Cross-linker modification notations MUST be mentioned once only. As mentioned above, any annotation made with the symbol # represents a way of linking different locations within the amino acid sequence. In ProForma 2.0 it is used for representing cross-linkers, and for grouping protein modifications (including glycans) to represent ambiguity. Branched peptides can be expressed using the same notation used for representing two cross-linked peptides, but using the term #BRANCH.

a) ETFGD[MOD:00093#BRANCH]//R[#BRANCH]ATER
6-Representation of ambiguity in the amino acid sequence. This mechanism can be used to represent changes that do not change the mass of the peptidoform/proteoform, but are not known. The way to encode this information is to use a parenthesis and a quotation mark including the ambiguous sequence represented in a preferred way.

Intellectual Property Statement
The PSI takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the PSI Chair.
The PSI invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this recommendation. Please address the information to the PSI Chair (see contacts information at PSI website).

Copyright Notice
Copyright (C) Proteomics Standards Initiative (2022). All Rights Reserved.
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the PSI or other organizations, except as needed for the purpose of developing Proteomics Recommendations in which case the procedures for copyrights defined in the PSI Document process must be followed, or as required to translate it into languages other than English.
The limited permissions granted above are perpetual and will not be revoked by the PSI or its successors or assigns.